Multinomial Model

Regression
GLM
Classification
Modeling the counts of outcomes across multiple categorical trials.

General Principles

To model the relationship between a vector outcome variable in which each element of the vector is a frequency from a set of more than two categories and one or more independent variables, we can use a Multinomial model.

Considerations

Note

Example

Below is an example code snippet demonstrating a Bayesian multinomial model using the Bayesian Inference (BI) package. This example is based on McElreath (2018).

Code
from BI import bi, jnp
import jax
# Setup device ------------------------------------------------
m = bi('cpu')

# Import Data & Data Manipulation ------------------------------------------------
# Import
from importlib.resources import files
data_path = m.load.sim_multinomial(only_path=True)
m.data(data_path, sep=',') 

# Define model ------------------------------------------------
def model(income, career):
    # Parameter prior distributions
    alpha = m.dist.normal(0, 1, shape=(2,), name='a')
    beta = m.dist.half_normal(0.5, shape=(1,), name='b')
    s_1 = alpha[0] + beta * income[0]
    s_2 = alpha[1] + beta * income[1]
    s_3 = [0]
    p = jnp.exp(jnp.stack([s_1[0], s_2[0], s_3[0]]))
    # Likelihood
    m.dist.multinomial(probs = p[career], obs=career)
    
# Run sampler ------------------------------------------------ 
m.fit(model)  

# Summary ------------------------------------------------
m.summary()
jax.local_device_count 16
  0%|          | 0/1000 [00:00<?, ?it/s]warmup:   0%|          | 1/1000 [00:01<18:23,  1.10s/it, 1 steps of size 2.34e+00. acc. prob=0.00]warmup:   7%|β–‹         | 70/1000 [00:01<00:11, 79.55it/s, 143 steps of size 1.21e-02. acc. prob=0.75]warmup:  13%|β–ˆβ–Ž        | 127/1000 [00:01<00:05, 148.29it/s, 31 steps of size 2.59e-01. acc. prob=0.77]warmup:  27%|β–ˆβ–ˆβ–‹       | 266/1000 [00:01<00:02, 355.15it/s, 7 steps of size 7.73e-01. acc. prob=0.78] warmup:  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 404/1000 [00:01<00:01, 551.97it/s, 3 steps of size 6.93e-01. acc. prob=0.79]sample:  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 547/1000 [00:01<00:00, 740.27it/s, 3 steps of size 7.96e-01. acc. prob=0.90]sample:  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 684/1000 [00:01<00:00, 886.29it/s, 3 steps of size 7.96e-01. acc. prob=0.90]sample:  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 822/1000 [00:01<00:00, 1010.16it/s, 3 steps of size 7.96e-01. acc. prob=0.90]sample:  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 968/1000 [00:01<00:00, 1128.46it/s, 7 steps of size 7.96e-01. acc. prob=0.90]sample: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [00:01<00:00, 516.94it/s, 7 steps of size 7.96e-01. acc. prob=0.90]
arviz - WARNING - Shape validation failed: input_shape: (1, 500), minimum_shape: (chains=2, draws=4)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a[0] 0.00 0.97 -1.60 1.51 0.05 0.04 428.89 395.09 NaN
a[1] 82.06 1.02 80.40 83.58 0.05 0.05 472.75 340.24 NaN
b[0] 40.96 0.50 40.12 41.65 0.02 0.02 616.30 368.44 NaN
library(BayesianInference)
jax = reticulate::import('jax')
# Setup platform------------------------------------------------
m=importBI(platform='cpu')

# import data ------------------------------------------------
m$data(m$load$sim_multinomial(only_path=T), sep=',')
keys <- c("income", "career")
income = unique(m$df$income)
income = income[order(income)]
values <- list(jnp$array(as.integer(income)),jnp$array( as.integer(m$df$career)))
m$data_on_model = py_dict(keys, values, convert = TRUE)

# Define model ------------------------------------------------
model <- function(income, career){
  # Parameter prior distributions
  alpha = bi.dist.normal(0, 1, name='alpha', shape = c(2))
  beta = bi.dist.normal(0.5, name='beta')
  
  s_1 = alpha[0] + beta * income[0]
  s_2 = alpha[1] + beta * income[1]
  s_3 = 0 # reference category
  
  p = jax$nn$softmax(jnp$stack(list(s_1, s_2, s_3)))
  
  # Likelihood
  m$dist$multinomial(probs=p[career], obs=career)
}


# Run sampler ------------------------------------------------ 
m$fit(model)  

# Summary ------------------------------------------------
m$summary()
using BayesianInference

# Setup device------------------------------------------------
m = importBI(platform="cpu")

# Import Data & Data Manipulation ------------------------------------------------
# Import
data_path = m.load.sim_multinomial(only_path = true)
m.data(data_path, sep=',')

# Define model ------------------------------------------------
@BI function model(income, career)
    # Parameter prior distributions
    alpha = m.dist.normal(0, 1, shape=(2,), name='a')
    beta = m.dist.half_normal(0.5, shape=(1,), name='b')
    s_1 = alpha[0] + beta * income[0]
    s_2 = alpha[1] + beta * income[1]
    # ⚠️  Use jnp.array to create a Python object, so [0] indexing works
    s_3 = jnp.array([0.0]) 
    p = jnp.exp(jnp.stack([s_1[0], s_2[0], s_3[0]]))
    # Likelihood
    m.dist.multinomial(probs = p[career], obs=career)
end

# Run mcmc ------------------------------------------------
m.fit(model)  # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary() # Get posterior distributions

Mathematical Details

We can model a vector of frequencies using a Dirichlet distribution. For an outcome variable Y_i with K categories, the Dirichlet likelihood function is:

Y_i \sim \text{Multinomial}(\theta_i) \\ \theta_i = \text{Softmax}(\phi_i) \\ \phi_{[i,1]} = \alpha_1 + \beta_1 X_i \\ \phi_{[i,2]} = \alpha_2 + \beta_2 X_i \\ ... \\ \phi_{[i,k]} = 0 \\ \alpha_{k} \sim \text{Normal}(0,1) \\ \beta_{k} \sim \text{Normal}(0.1)

Where:

  • Y_i is the outcome (i.e. the vector of frequencies for each k categories) for observation i.

  • \theta_i is a vector unique to each observation, i, which gives the probability of observing i in category k.

  • \phi_i give the linear model for each of the k categories. Note that we use the softmax function to ensure that that the probabilities \theta_i form a simplex πŸ›ˆ.

  • Each element of \phi_i is obtained by applying a linear regression model with its own respective intercept \alpha_k and slope coefficient \beta_k. To ensure the model is identifiable, one category, K, is arbitrarily chosen as a reference or baseline category. The linear predictor for this reference category is set to zero. The coefficients for the other categories then represent the change in the log-odds of being in that category versus the reference category.

Reference(s)

McElreath, Richard. 2018. Statistical Rethinking: A Bayesian course with examples in R and Stan. Chapman; Hall/CRC.